As a Public Relations Data Analyst, I am always facing the enormous amount of scattered media monitoring data. Although for a short time now, PR Agencies (and so hopefully their clients) are not expecting high requirements on metrics and measurements; But I believe that one day, we would be urged to understand deeper our daily media monitoring datasets: to find hidden patterns that has never been figured out before.
At present, the intuition of media monitoring datasets is merely to report all the online news articles; based on brands/events from our clients. So, to be accountable, we have to include all the URL Links related to each articles.
But now here is the thing. We are usually taking the total amount of
articles to resemble media coverage. However, if we set one single
parameter (one column) of media name beforehand–or while
building up the dataset, then we can also track coverages compared to
each media names afterwards. But if not, we may not be able to get a
better data insights. We might just take the dataset as it is for a
limited data filtering on Excel/spreadsheets.
The DNS Hierarchy Example (src: educative.io)
sorry if I couldn’t
make this picture centered by default. it’s probably on my css params….
this long caption was made intentionally, hahaha
To gain a better understanding, I suggest that we can take a look at
this,
and this
documentation, please :)
By seeing patterns of the domain/DNS hierarchy, I got something that
can bring us possibilities. We can actually utilize Regular Expressions
(Regex) to take out the http[s]:// from each URL entries.
With Regex, we can also separate all instances preceded by dots (.) or
slashes (/).
After cleaning data using Regex, we are enabled to filter domain
names and subdomains by running several lines of codes. I prefer to use
R-Tidyverse packages because of its higher level functions
and parameters compared to the other data manipulation languages.
In Public Relations Scope, all the subdomains might be
considered as “Media Desk” or “Media Region”. Taking The New York Times
(http://nytimes.com)
as an example, they have several media desks, namely:
cooking.nytimes.comparenting.nytimes.comadvertising.nytimes.comcityandsuburban.nytimes.com, and so on…Same here, the Indonesian media are tend to specify their subdomains based on media desks or district/city names. I fancy to take Tribunnews (http://tribunnews.com) as an example. We can find some media regions here, such as:
jabar.tribunnews.com (West Java)kaltim.tribunnews.com (East Borneo)jogja.tribunnews.com (Special Region of Yogyakarta),
and etc…To explain the data manipulation workflow as simple as possible, this
portfolio depends majorly in Regex cleansing and Dplyr filtering.
Regex cleansing can help us to get rid of unwanted instances like https
and slashes. Aside from that, Dplyr is seemingly useful in helping us to
find subdomains and root domains; as my goal is to retrieve and classify
these two things to gain a deeper analysis on media desks.
Okayy.. without any further ado, let us dive into the execution!
library(dplyr)
library(googlesheets4)
library(tidyr)
library(stringi)
library(stringr)
library(tidyr)
library(purrr)
library(janitor)I created a custom operator that works as a negation of
%in%. I call this as %out%.
Basically %out% can be used in
dplyr::filter(), so that it can except all the values
stated inside filter() function
# initiating %out%
`%out%` <- function(a,b) ! a %in% bgs4_deauth()
options(gargle_oauth_email = "gemforwork329@gmail.com")
dat_URL <-
read_sheet(
"https://docs.google.com/spreadsheets/d/1mKhCrBM5-uXZ_aRa72tCubgeSTZigipTG4DhLWr7t48/edit#gid=0"
)
dat_URLHere are the articles from our client. That was a mobile phone
product launch and prelaunch. As we can see, we got 4,222
different URLs from various media names. Yep, only the URLs.
url_list <- dat_URL %>%
select(URL) %>%
as.list() %>%
unlist()http[s]:// patterncleanurl3 <- gsub("http[s]?://","",url_list)
cleanurl3[1:5]#> URL1
#> "tekno.kompas.com/read/2021/08/11/21150047/harga-samsung-galaxy-z-fold-3-dan-z-flip-3-di-indonesia?page=all."
#> URL2
#> "www.idntimes.com/tech/gadget/alfonsus-adi-putra-2/produk-terbaru-yang-diluncurkan-dalam-samsung-galaxy-unpacked-2021-ini?q="
#> URL3
#> "www.antaranews.com/berita/2321406/samsung-luncurkan-galaxy-z-fold-3-dan-z-flip-3"
#> URL4
#> "www.jawapos.com/oto-dan-tekno/gadget/11/08/2021/samsung-rilis-ponsel-lipat-berikutnya-galaxy-z-fold-3-dan-z-flip-3-5g/"
#> URL5
#> "kumparan.com/kumparantech/samsung-galaxy-z-fold3-dan-z-flip3-resmi-rilis-ini-harganya-di-indonesia-1wJ1Ri6x0xL"
www. patterncleanurl4 <- gsub("www.","",cleanurl3)
cleanurl4[1:5]#> URL1
#> "tekno.kompas.com/read/2021/08/11/21150047/harga-samsung-galaxy-z-fold-3-dan-z-flip-3-di-indonesia?page=all."
#> URL2
#> "idntimes.com/tech/gadget/alfonsus-adi-putra-2/produk-terbaru-yang-diluncurkan-dalam-samsung-galaxy-unpacked-2021-ini?q="
#> URL3
#> "antaranews.com/berita/2321406/samsung-luncurkan-galaxy-z-fold-3-dan-z-flip-3"
#> URL4
#> "jawapos.com/oto-dan-tekno/gadget/11/08/2021/samsung-rilis-ponsel-lipat-berikutnya-galaxy-z-fold-3-dan-z-flip-3-5g/"
#> URL5
#> "kumparan.com/kumparantech/samsung-galaxy-z-fold3-dan-z-flip3-resmi-rilis-ini-harganya-di-indonesia-1wJ1Ri6x0xL"
(/)pre_result_regex <- sapply(strsplit(cleanurl4,"/"),"[",1)
pre_result_regex[1:5]#> URL1 URL2 URL3 URL4
#> "tekno.kompas.com" "idntimes.com" "antaranews.com" "jawapos.com"
#> URL5
#> "kumparan.com"
(.)result_regex <- as.list(strsplit(pre_result_regex, "\\."))
result_regex[1:5]#> $URL1
#> [1] "tekno" "kompas" "com"
#>
#> $URL2
#> [1] "idntimes" "com"
#>
#> $URL3
#> [1] "antaranews" "com"
#>
#> $URL4
#> [1] "jawapos" "com"
#>
#> $URL5
#> [1] "kumparan" "com"
This is the end of Regex cleansing. From result_regex
list/vector, we can convert it as a data frame
df_media <- as.data.frame(stringi::stri_list2matrix(result_regex, byrow = TRUE), stringsAsFactors=FALSE) ; head(df_media, 30)nrow(df_media)#> [1] 4222
names(df_media) <- paste0("instance", seq_along(df_media)) ; head(df_media, 20)I interested to inspect all the unique values from
instance2 (supposed that there are many values of
root domains inside the instance2 column).
Here I am just guessing all the possibilities. However, it is better for
me to leave the result as it is, as I wanted to explore more and to gain
deeper EDA insights as well.
instance2_unique <- df_media %>%
select(instance2) %>%
distinct()
instance2_uniqueI got one hundred-ish values, we can see that there’s still non
root domain values on instance2 column
(e.g. com, id, co, web, my, net, etc); which are supposed to be
identified as top level domains (TLD) – we got more insights on the data
pattern
instance1 column?instance1_unique <- df_media %>%
select(instance1) %>%
distinct(); instance1_uniqueFrom inspecting both instance1 and
instance2 column, we can infer that mostly,
instance1 contains plenty of domain names (namely:
kumparan, beritasatu, cnnindonesia, antaranews). Since I have been taken
out all the www through the data cleansing using Regex.
So, I am pretty sure that we can also get several media desks on
three instances domain (xxx.xxx.xxx); by simply filtering
instance1_unique on instance2 column; Through
this operation, we can actually stating domain names by itself, thanks
to the pattern of data.
(using instance1_unique$instance1 as root domain
identifier)
domain_three_instances <- df_media %>%
select(instance1, instance2, instance3, instance4) %>%
# filtering root domains (instance2) that contains matching values
# of instance1_unique$instance1
filter(instance2 %in% instance1_unique$instance1) %>%
filter(instance4 %in% NA) %>%
select(-instance4) %>%
# Taking out two instances domain that still remain
filter(instance3 %out% NA) %>%
distinct() %>%
arrange(instance2); domain_three_instancesBingo! we got our first filtered list of three instances domain.
Whilst Instance2 contains all the domain names (media
names), we can also retrieve some insights about how various these media
desks are (instance1). In Public Relations scope, varieties of online
media desks (a.k.a subdomains) is one thing that we could not
anticipate. Reading those things one by one takes a lot of time and
may reduce one’s working performance.
Through this method of filtering media domains, we can hopefully
automate the process within a single R Session. Making these codes
reproducible is most likely a feasible thing.
(using isntance2_unique$instance2 as root domain
identifier)
This may fetch domains with a single TLD
domain_three_instances_1 <- df_media %>%
select(instance1, instance2, instance3, instance4) %>%
# filtering root domains (instance2) that contains matching values
# of instance2_unique$instance2
filter(instance2 %in% instance2_unique$instance2) %>%
filter(instance4 %in% NA) %>%
select(-instance4) %>%
filter(instance3 %out% NA) %>%
distinct() %>%
# Negate our previous findings so that we can retrieve the new one
filter(instance2 %out% domain_three_instances$instance2) %>%
# Negate co, my, go as it supposed to be the SLD and TLD
filter(instance2 %out% c("co", "my", "go")) %>%
arrange(instance2)which contains both SLDs and TLDs (.my.id, .co.id, .com.my,
etc..)
In this case, domain names should be contained in instance2
column.
We need that pattern to combine all the results.
domain_three_instances_2 <- df_media %>%
select(instance1, instance2, instance3, instance4) %>%
filter(instance2 %in% instance2_unique$instance2) %>%
filter(instance4 %in% NA) %>%
select(-instance4) %>%
filter(instance3 %out% NA) %>%
distinct() %>%
filter(instance2 %out% domain_three_instances$instance2) %>%
filter(instance2 %in% c("co", "my", "sg", "de", "com", "gov", "go")) %>%
# Temporary colnames
rename(two = instance1) %>%
rename(three = instance2) %>%
rename(four = instance3) %>%
# Normalized colnames (instance2 as root domain)
rename(instance2 = two) %>%
rename(instance3 = three) %>%
rename(instance4 = four) %>%
arrange(instance2) ; domain_three_instances_2We can get all domains with four instances (xxx.xxx.xxx.xxx), by only dropping ALL Missing values trough the dataset.
domain_four_instances <- df_media %>%
select(instance1, instance2, instance3, instance4) %>%
tidyr::drop_na(instance4) %>%
distinct() %>%
arrange(instance2)
domain_four_instancesthree_instances_final <- rbind(domain_three_instances, domain_three_instances_1)three_and_fourins <- bind_rows(domain_four_instances, three_instances_final)to_filter_twoins <- three_and_fourins %>%
select(instance2) %>%
distinct()domain_two_instances_com <- df_media %>%
select(instance1, instance2, instance3, instance4) %>%
filter(instance3 %in% NA) %>%
select(-instance3) %>%
filter(instance2 %in% "com") %>%
distinct() %>%
rename(two = instance1) %>%
rename(three = instance2) %>%
rename(instance2 = two) %>%
rename(instance3 = three)
domain_two_instances_1 <- df_media %>%
select(instance1, instance2, instance3, instance4) %>%
filter(instance3 %in% NA) %>%
select(-instance3) %>%
filter(instance2 %out% to_filter_twoins$instance2) %>%
select(-instance4) %>%
distinct() %>%
rename(two = instance1) %>%
rename(three = instance2) %>%
rename(instance2 = two) %>%
rename(instance3 = three)
two_instances <- bind_rows(domain_two_instances_com, domain_two_instances_1)all_instances <- bind_rows(three_and_fourins, domain_three_instances_1, domain_three_instances, domain_three_instances_2, domain_four_instances, two_instances)
all_instancesthree_and_fourins <- bind_rows(domain_four_instances, three_instances_final)all_instances <- bind_rows(three_and_fourins, domain_three_instances_1, domain_three_instances, domain_three_instances_2, domain_four_instances, two_instances)all_instances <- all_instances %>%
rename(root_domain = instance2) %>%
rename(sub_domain = instance1) %>%
rename(tld_1 = instance3) %>%
rename(sld = instance4) %>%
filter(root_domain %out% "") %>%
mutate(final_domain = purrr::pmap_chr(., ~ c(...) %>%
na.omit %>%
paste(collapse = "."))) %>%
mutate(medianame_root_domain = root_domain) %>%
arrange(medianame_root_domain)
all_instances <- all_instances %>%
distinct()unique_medianames <- all_instances %>%
select(final_domain, medianame_root_domain)
unique_medianamescleanurl3 <- gsub("http[s]?://","",url_list)
cleanurl4 <- gsub("www.","",cleanurl3)
final_domain <- sapply(strsplit(cleanurl4,"/"),"[",1)
final_domain <- as.data.frame(final_domain)
final_domain <- final_domain %>%
select(final_domain) %>%
rename(cleaned_domain = final_domain)
final_domain$NO <- 1:nrow(final_domain)
unique_medianames <- unique_medianames %>%
select(final_domain, medianame_root_domain) %>%
rename(cleaned_domain = final_domain)to_viz <- merge(final_domain, unique_medianames, by = "cleaned_domain", all.x = FALSE)
to_viz %>% arrange(NO)We got 4,223 rows instead of 4,222. I suppose that this condition was
indicated bymedianame_root_domain column, which is probably
multiplied based on the identical value of cleaned_domain
column.
identify <- to_viz %>%
select(cleaned_domain, medianame_root_domain) %>%
distinct()
the_duplicated_one <-
identify %>% select(cleaned_domain) %>% janitor::get_dupes() %>% distinct()identify %>%
filter(cleaned_domain %in% the_duplicated_one$cleaned_domain)Huft! that was probably a mistake in filtering. This should be in my
notes to be figured out later… but for now, this condition is not a
fatal one. We can still take out the unwanted values in
medianame_root_domain column by only negating it (in other
words, ".com" should not be identified as root domain)
.comto_viz <- to_viz %>%
filter(medianame_root_domain %out% "com")to_viz <- to_viz %>%
arrange(NO); to_vizDone! we got 4,222 entries which is similar to our data
input
Using a special function called as.treemapDF.
I got a lower-level R-codes from Stackoverflow (sorry I forgot the
documentation link), and am going to feature those codes on the appendix
as well
count_domains <- to_viz %>%
select(cleaned_domain, medianame_root_domain) %>%
rename(`Media Outlet` = medianame_root_domain) %>%
rename(`URL Simplified` = cleaned_domain) %>%
group_by(`Media Outlet`, `URL Simplified`) %>%
summarise("values" = n())
media_treemap <- as.treemapDF(count_domains, valueCol = "values") %>% as.data.frame()
new_parent <- media_treemap$parents %>%
tidyr::replace_na(., "")
media_treemap <- bind_cols(media_treemap, new_parent) %>%
rename(new_parent = ...5)library(plotly)
media_plotly_treemap <-
plot_ly(
data = media_treemap,
type = "treemap",
ids = ~ ids,
labels = ~ labels %>% stringr::str_wrap(width = 15),
parents = ~ new_parent,
values = ~ values,
domain = list(column = 2)
) %>% layout(title = list(text="Media Desk Spread: Foldable Phone Product Launch and Prelaunch", y = 0.98, x = 0.55, xanchor = 'center', yanchor = 'top')) %>%
layout(uniformtext = list(minsize = 10)) %>% layout(colorway = ~
c("#005E7C",
"#001242",
"#632e60",
"#162F20")) %>%
config(displayModeBar = FALSE)media_plotly_treemapThis is the plotly treemap visualization which was generated from our
data. We can certainly depict on how various the media names and media
desks here.
From the left to the right, we can see the media names sorted by the
number of coverage (and so with variety on their media desks). There are
Detik, Kurio, Teknosignal,
Kompas, Line (Line Today–The media agregator),
Siapgrak, Tribunnews,
Pikiran-rakyat, Grid, and many more.
Speaking about coverage, it is actually resembled by the area. What makes me into this treemap visualization, is the interactivity from Plotly that making it hoverable and clickable (also tidier compared to the other package). I hope that we can have fun, to explore the interactivity of this plot.
I interested to analyze several media names: they are
Detik, Kompas, Tribunnews, and
Antaranews; since those media names has various media desks
based on the treemap figured.
varieties_and_numbers <-
to_viz %>%
group_by(cleaned_domain, medianame_root_domain) %>%
summarise("coverage" = n()) %>%
arrange(desc(coverage))detik <-
varieties_and_numbers %>%
filter(medianame_root_domain %in% "detik")
detiksum(detik$coverage)#> [1] 145
The first media name we got here is Detik(.com). As
visualized on the treemap, Detik has eight different media
desks, namely inet.detik.com,
wolipop.detik.com, finance.detik.com,
travel.detik.com, hot.detik.com,
health.detik.com, sport.detik.com, and
20.detik.com. In terms of coverage, Detik
plays a significant article buzz, with 145 articles in total.
When it comes to the spread of media desks, Detik seems to put most
articles on inet.detik.com (Inet refers to “internet and
all tech related stuffs” – 70 Articles), followed by
wolipop.detik.com (wolipop resembles lifestyle contents –
20 articles). These media desk placements seems understandable, since
all the articles are speaking about smartphone Product Launch and
Prelaunch.
Detik’s coverage was followed by Kurio and
Teknosignal. But none of them have media desks. So, I am
heading to the next one, which is Kompas.
varieties_and_numbers %>%
filter(medianame_root_domain %in% "kompas")By coverage, kompas published 105 articles on this product launch and
prelaunch campaign. Their mediadesks seems not so various, as it
featured mostly on tekno.kompas.com; followed by
kompas.tv, beta.kompas.tv, and
indeks.kompas.com.
The next worth-to-analyze data point is Tribunnews. On
this kind of media, they prefer to specify desks based on region
(city/province name). However, on this case, they tend to put our
client’s articles on tribunnews.com; which is their main
domain.
varieties_and_numbers %>%
filter(medianame_root_domain %in% "tribunnews")I found something interesting on this media. Since,
Pikiran-rakyat has so many media desks; with
mantrasukabumi.pikiran-rakyat.com on the biggest coverage
value (13 Articles). One thing to analyze further; did they spread the
articles based on domain performance? Or is there any other
preference?
varieties_and_numbers %>%
filter(medianame_root_domain %in% "pikiran-rakyat")Grid also prefer to place our client’s articles on
nextren.grid.id (read: next trend), followed by
infokomputer.grid.id (read: the infos about
computer/gadgets). Their desk placements also seems reasonable.
varieties_and_numbers %>%
filter(medianame_root_domain %in% "grid")Antaranews has the same pattern as
Tribunnews. In coverages and desk varieties, they were not
the biggest one
varieties_and_numbers %>%
filter(medianame_root_domain %in% "antaranews")as.treemapDF <- function(DF, valueCol = NULL){
require(data.table)
colNamesDF <- names(DF)
if(is.data.table(DF)){
DT <- copy(DF)
} else {
DT <- data.table(DF, stringsAsFactors = FALSE)
}
DT[, root := ""]
colNamesDT <- names(DT)
if(is.null(valueCol)){
setcolorder(DT, c("root", colNamesDF))
} else {
setnames(DT, valueCol, "values", skip_absent=TRUE)
setcolorder(DT, c("root", setdiff(colNamesDF, valueCol), "values"))
}
hierarchyCols <- setdiff(colNamesDT, "values")
hierarchyList <- list()
for(i in seq_along(hierarchyCols)){
currentCols <- colNamesDT[1:i]
if(is.null(valueCol)){
currentDT <- unique(DT[, ..currentCols][, values := .N, by = currentCols], by = currentCols)
} else {
currentDT <- DT[, lapply(.SD, sum, na.rm = TRUE), by=currentCols, .SDcols = "values"]
}
setnames(currentDT, length(currentCols), "labels")
hierarchyList[[i]] <- currentDT
}
hierarchyDT <- rbindlist(hierarchyList, use.names = TRUE, fill = TRUE)
parentCols <- setdiff(names(hierarchyDT), c("labels", "values", valueCol))
hierarchyDT[, parents := apply(.SD, 1, function(x){fifelse(all(is.na(x)), yes = NA_character_, no = paste(x[!is.na(x)], sep = ":", collapse = " - "))}), .SDcols = parentCols]
hierarchyDT[, ids := apply(.SD, 1, function(x){paste(x[!is.na(x)], collapse = " - ")}), .SDcols = c("parents", "labels")]
hierarchyDT[, c(parentCols) := NULL]
return(hierarchyDT)
}count_domainsmedia_treemap